Safe and Efficient Off-Policy Reinforcement Learning

نویسندگان

  • Rémi Munos
  • Tom Stepleton
  • Anna Harutyunyan
  • Marc G. Bellemare
چکیده

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of “off-policyness”; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q∗ without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins’ Q(λ), which was an open problem since 1989. We illustrate the benefits of Retrace(λ) on a standard suite of Atari 2600 games. One fundamental trade-off in reinforcement learning lies in the definition of the update target: should one estimate Monte Carlo returns or bootstrap from an existing Q-function? Return-based methods (where return refers to the sum of discounted rewards ∑ t γ rt) offer some advantages over value bootstrap methods: they are better behaved when combined with function approximation, and quickly propagate the fruits of exploration (Sutton, 1996). On the other hand, value bootstrap methods are more readily applied to off-policy data, a common use case. In this paper we show that learning from returns need not be at cross-purposes with off-policy learning. We start from the recent work of Harutyunyan et al. (2016), who show that naive off-policy policy evaluation, without correcting for the “off-policyness” of a trajectory, still converges to the desired Q value function provided the behavior μ and target π policies are not too far apart (the maximum allowed distance depends on the λ parameter). Their Q(λ) algorithm learns from trajectories generated by μ simply by summing discounted off-policy corrected rewards at each time step. Unfortunately, the assumption that μ and π are close is restrictive, as well as difficult to uphold in the control case, where the target policy is greedy with respect to the current Q-function. In that sense this algorithm is not safe: it does not handle the case of arbitrary “off-policyness”. Alternatively, the Tree-backup (TB(λ)) algorithm (Precup et al., 2000) tolerates arbitrary target/behavior discrepancies by scaling information (here called traces) from future temporal differences by the product of target policy probabilities. TB(λ) is not efficient in the “near on-policy” case (similar μ and π), though, as traces may be cut prematurely, blocking learning from full returns. In this work, we express several off-policy, return-based algorithms in a common form. From this we derive an improved algorithm, Retrace(λ), which is both safe and efficient, enjoying convergence guarantees for off-policy policy evaluation and – more importantly – for the control setting. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. Retrace(λ) can learn from full returns retrieved from past policy data, as in the context of experience replay (Lin, 1993), which has returned to favour with advances in deep reinforcement learning (Mnih et al., 2015; Schaul et al., 2016). Off-policy learning is also desirable for exploration, since it allows the agent to deviate from the target policy currently under evaluation. To the best of our knowledge, this is the first online return-based off-policy control algorithm which does not require the GLIE (Greedy in the Limit with Infinite Exploration) assumption (Singh et al., 2000). In addition, we provide as a corollary the first proof of convergence of Watkins’ Q(λ) (see, e.g., Watkins, 1989; Sutton and Barto, 1998). Finally, we illustrate the significance of Retrace(λ) in a deep learning setting by applying it to the suite of Atari 2600 games provided by the Arcade Learning Environment (Bellemare et al., 2013).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stochastic Motion Planning for Hopping Rovers on Small Solar System Bodies

Hopping rovers have emerged as a promising platform for the future surface exploration of small Solar System bodies, such as asteroids and comets. However, hopping dynamics are governed by nonlinear gravity fields and stochastic bouncing on highly irregular surfaces, which pose several challenges for traditional motion planning methods. This paper presents the first ever discussion of motion pl...

متن کامل

Convergent Tree-Backup and Retrace with Function Approximation

Off-policy learning is key to scaling up reinforcement learning as it allows to learn about a target policy from the experience generated by a different behavior policy. Unfortunately, it has been challenging to combine off-policy learning with function approximation and multi-step bootstrapping in a way that leads to both stable and efficient algorithms. In this paper, we show that the Tree Ba...

متن کامل

Sample Efficient On-Line Learning of Optimal Dialogue Policies with Kalman Temporal Differences

Designing dialog policies for voice-enabled interfaces is a tailoring job that is most often left to natural language processing experts. This job is generally redone for every new dialog task because cross-domain transfer is not possible. For this reason, machine learning methods for dialog policy optimization have been investigated during the last 15 years. Especially, reinforcement learning ...

متن کامل

Safety, Risk Awareness and Exploration in Reinforcement Learning

Safety, Risk Awareness and Exploration in Reinforcement Learning by Teodor Mihai Moldovan Doctor of Philosophy in Computer Science University of California, Berkeley Professor Pieter Abbeel, Chair Replicating the human ability to solve complex planning problems based on minimal prior knowledge has been extensively studied in the field of reinforcement learning. Algorithms for discrete or approx...

متن کامل

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016